GDAS015-NGS的可视化

NGS的可视化

我们将用以下的包来说明一下如何对NGS数据进行可视化。

#biocLite("pasillaBamSubset")
#biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
library(pasillaBamSubset)
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
fl1 <- untreated1_chr4()
fl2 <- untreated3_chr4()

我们将会采用四种方式来查看NGS的数据，分别是IGV，plot,Gviz和ggbio。

IGV

将这些文献从R library的目录中复制到当前的工作目标。首先将工作目录设置为源文件位置。我们需要使用Rsamtools包来对BAM文件加上索引，然后使用IGV来查看，如下所示：

file.copy(from=fl1,to=basename(fl1))
## [1] FALSE
file.copy(from=fl2,to=basename(fl2))
## [1] FALSE
library(Rsamtools)
## Loading required package: Biostrings
## Loading required package: XVector
indexBam(basename(fl1))
##       untreated1_chr4.bam 
## "untreated1_chr4.bam.bai"
indexBam(basename(fl2))
##       untreated3_chr4.bam 
## "untreated3_chr4.bam.bai"

IGV可以在这个网站下载: https://www.broadinstitute.org/igv/home

下载的时候需要提供你的电子邮箱，现在我们使用IGV来查看一个基因，例如igs。

如果无法下载IGV，那么可以使用UCSC的 Genome Browser：http://genome.ucsc.edu/cgi-bin/hgTracks

UCSC Genome Browser是一个很好的资源，有许多涉及基因注释，多个物种的保护，以及ENCODE表观遗传信号。但是，UCSC Genome Browser要求你将基因组文件上传到他们的服务器，或将数据放在公共可用服务器上。如果你处理的数据涉及保密，那就另当别论了。

Simple plot

现在我们来看一使用plot()函数来绘制基因。

1	library(GenomicRanges)

注意：如果你使用的是Bioconductor 14，它与R 3.1配合使用，需要加载这个包。如想是Bioconductor 13（它与R 3.0.x配合使用），则不需要加载这个包。

1 2	library(GenomicAlignments) ## Loading required package: SummarizedExperiment

我们从文件fl1中读取比对结果(alignments)。然后我们使用coverage函数来计算碱基对的覆盖。然后我们提取与我们感兴趣的基因重叠的覆盖区子集，并将此覆盖率从RleList转换为numeric向量。我们前面提到，Rle对象是一种压缩对象，例如重复数字会被压缩为2个数字，其中一个是重复的数字，另外一个是表示这个数字重复的次数。

x <- readGAlignments(fl1)
xcov <- coverage(x)
z <- GRanges("chr4",IRanges(456500,466000))
# Bioconductor 2.14
xcov[z]
## RleList of length 1
## $$chr4
## integer-Rle of length 9501 with 1775 runs
##   Lengths: 1252   10   52    4    7    2 ...   10    7   12  392   75 1041
##   Values :    0    2    3    4    5    6 ...    3    2    1    0    1    0
# Bioconductor 2.13
xcov$chr4[ranges(z)]
## integer-Rle of length 9501 with 1775 runs
##   Lengths: 1252   10   52    4    7    2 ...   10    7   12  392   75 1041
##   Values :    0    2    3    4    5    6 ...    3    2    1    0    1    0
xnum <- as.numeric(xcov$chr4[ranges(z)])
plot(xnum)

plot of chunk unnamed-chunk-5

现在我们读取另外一个文件fl2，如下所示：

y <- readGAlignmentPairs(fl2)
## Warning in .make_GAlignmentPairs_from_GAlignments(gal, strandMode = strandMode, : 63 pairs (0 proper, 63 not proper) were dropped because the seqname
##   or strand of the alignments in the pair were not concordant.
##   Note that a GAlignmentPairs object can only hold concordant pairs at the
##   moment, that is, pairs where the 2 alignments are on the opposite strands
##   of the same reference sequence.
ycov <- coverage(y)
ynum <- as.numeric(ycov$chr4[ranges(z)])
plot(xnum, type="l", col="blue", lwd=2)
lines(ynum, col="red", lwd=2)

plot of chunk unnamed-chunk-6

我们可以单独放大一个外显子，如下所示：

1 2	plot(xnum, type="l", col="blue", lwd=2, xlim=c(6200,6600)) lines(ynum, col="red", lwd=2)

plot of chunk unnamed-chunk-7

使用转录本数据库提取感兴趣的基因

假设我们想可视化基因igs。我们可以从转录本数据库TxDb.Dmelanogaster.UCSC.dm3.ensGene 中提取它，但我们首先要知道这个基因的Ensembl基因名，现在我们使用前面学到的函数来实现这一功能，如下所示：

# biocLite("biomaRt")
library(biomaRt)
m <- useMart("ensembl", dataset = "dmelanogaster_gene_ensembl")
lf <- listFilters(m)
lf[grep("name", lf$description, ignore.case=TRUE),]
##                             name
## 1                chromosome_name
## 17         with_flybasename_gene
## 19   with_flybasename_transcript
## 23  with_flybasename_translation
## 52            external_gene_name
## 66              flybasename_gene
## 67        flybasename_transcript
## 68       flybasename_translation
## 93                 wikigene_name
## 107               go_parent_name
## 208               so_parent_name
##                                      description
## 1                                Chromosome name
## 17                   with FlyBaseName gene ID(s)
## 19             with FlyBaseName transcript ID(s)
## 23                with FlyBaseName protein ID(s)
## 52          Associated Gene Name(s) [e.g. BRCA2]
## 66           FlyBaseName Gene ID(s) [e.g. cul-2]
## 67  FlyBaseName Transcript ID(s) [e.g. cul-2-RB]
## 68     FlyBaseName Protein ID(s) [e.g. cul-2-PB]
## 93                 WikiGene Name(s) [e.g. Ir21a]
## 107                             Parent term name
## 208                             Parent term name
map <- getBM(mart = m,
  attributes = c("ensembl_gene_id", "flybasename_gene"),
  filters = "flybasename_gene", 
  values = "lgs")
map
##   ensembl_gene_id flybasename_gene
## 1     FBgn0039907              lgs

现在我们提取每个基因的外显子，然后提取igs基因的外显子，如下所示：

library(GenomicFeatures)
grl <- exonsBy(TxDb.Dmelanogaster.UCSC.dm3.ensGene, by="gene")
gene <- grl[[map$ensembl_gene_id[1]]]
gene
## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames           ranges strand |   exon_id   exon_name
##          <Rle>        <IRanges>  <Rle> | <integer> <character>
##   [1]     chr4 [457583, 459544]      - |     63350        <NA>
##   [2]     chr4 [459601, 459791]      - |     63351        <NA>
##   [3]     chr4 [460074, 462077]      - |     63352        <NA>
##   [4]     chr4 [462806, 463015]      - |     63353        <NA>
##   [5]     chr4 [463490, 463780]      - |     63354        <NA>
##   [6]     chr4 [463839, 464533]      - |     63355        <NA>
##   -------
##   seqinfo: 15 sequences (1 circular) from dm3 genome

最后，我们绘制出这些外显子的范围，如下所示：

rg <- range(gene)
plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
       end(gene),rep(0,length(gene)),
       lwd=3, length=.1)

plot of chunk unnamed-chunk-10

不过实际上这个基因位于负链，因此我们需要改一下箭头的方向，如下所示：

plot(c(start(rg), end(rg)), c(0,0), type="n", xlab=seqnames(gene)[1], ylab="")
arrows(start(gene),rep(0,length(gene)),
       end(gene),rep(0,length(gene)),
       lwd=3, length=.1, 
       code=ifelse(as.character(strand(gene)[1]) == "+", 2, 1))

plot of chunk unnamed-chunk-11

Gviz

我们将简要展示两个在Bioconductor中可视化基因组数据的软件包。请注意，这两个包中的任何一个都可以用于绘制多种数据的图形。我们将在这里展示如何像以前一样制作覆盖图(converage plot)：

#biocLite("Gviz")
library(Gviz)
## Loading required package: grid
gtrack <- GenomeAxisTrack()
atrack <- AnnotationTrack(gene, name = "Gene Model")
plotTracks(list(gtrack, atrack))

plot of chunk unnamed-chunk-12

提取覆盖。Gviz中的数据最好是是GRanges对象，因此我们需要将RleList转换为GRanges对象，如下所示：

xgr <- as(xcov, "GRanges")
ygr <- as(ycov, "GRanges")
dtrack1 <- DataTrack(xgr[xgr %over% z], name = "sample 1")
dtrack2 <- DataTrack(ygr[ygr %over% z], name = "sample 2")
plotTracks(list(gtrack, atrack, dtrack1, dtrack2))

plot of chunk unnamed-chunk-13

1	plotTracks(list(gtrack, atrack, dtrack1, dtrack2), type="polygon")

plot of chunk unnamed-chunk-13

ggbio

#biocLite("ggbio")
library(ggbio)
## Loading required package: ggplot2
## Warning: replacing previous import by 'BiocGenerics::Position' when loading
## 'ggbio'
## Need specific help about ggbio? try mailing 
##  the maintainer or visit http://tengfei.github.com/ggbio/
## 
## Attaching package: 'ggbio'
## The following objects are masked from 'package:ggplot2':
## 
##     geom_bar, geom_rect, geom_segment, ggsave, stat_bin,
##     stat_identity, xlim
autoplot(gene)

plot of chunk unnamed-chunk-14

autoplot(fl1, which=z)
## reading in as Bamfile
## Parsing raw coverage...
## Read GAlignments from BamFile...
## extracting information...

plot of chunk unnamed-chunk-14

autoplot(fl2, which=z)
## reading in as Bamfile
## Parsing raw coverage...
## Read GAlignments from BamFile...
## extracting information...

plot of chunk unnamed-chunk-14

参考资料

Visualizing NGS data
IGV: https://www.broadinstitute.org/igv/home
Gviz: http://www.bioconductor.org/packages/release/bioc/html/Gviz.html
ggbio:http://www.bioconductor.org/packages/release/bioc/html/ggbio.html
UCSC Genome Browser: zooms and scrolls over chromosomes, showing the work of annotators worldwide: http://genome.ucsc.edu/
Ensembl genome browser: genome databases for vertebrates and other eukaryotic species:http://ensembl.org
Roadmap Epigenome browser: public resource of human epigenomic data: http://www.epigenomebrowser.org http://genomebrowser.wustl.edu/ http://epigenomegateway.wustl.edu/
“Sashimi plots” for RNA-Seq: http://genes.mit.edu/burgelab/miso/docs/sashimi.html
Circos: designed for visualizing genomic data in a cirlce: http://circos.ca/
SeqMonk: a tool to visualise and analyse high throughput mapped sequence data: http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/